A Corpus of Spontaneous Speech in Lectures: The KIT Lecture Corpus for Spoken Language Processing and Translation

نویسندگان

Eunah Cho

Sarah Fünfer

Sebastian Stüker

Alexander H. Waibel

چکیده

With the increasing number of applications handling spontaneous speech, the needs to process spoken languages become stronger. Speech disfluency is one of the most challenging tasks to deal with in automatic speech processing. As most applications are trained with well-formed, written texts, many issues arise when processing spontaneous speech due to its distinctive characteristics. Therefore, more data with annotated speech disfluencies will help the adaptation of natural language processing applications, such as machine translation systems. In order to support this, we have annotated speech disfluencies in German lecture data collected at KIT. In this paper we describe how we annotated the disfluencies in the data and provide detailed statistics on the size of the corpus and the speakers. Moreover, machine translation performance on a source text including disfluencies is compared to the results of the translation of a source text without different sorts of disfluencies or no disfluencies at all.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The KIT Lecture Corpus for Speech Translation

Academic lectures offer valuable content, but often do not reach their full potential audience due to the language barrier. Human translations of lectures are too expensive to be widely used. Speech translation technology can be an affordable alternative in this case. State-of-the-art spoken language translation systems utilize statistical models that need to be trained on large amounts of in-d...

متن کامل

Efficient Access to Lecture Audio Archives through Spoken Language Processing

The paper firstly addresses the current state of speech recognition using the “Corpus of Spontaneous Japanese (CSJ)”. It is shown that the large-scale corpus had strong impact in training acoustic and language models considering morphological and pronunciation variations which are characteristic to spontaneous Japanese. Unsupervised adaptation of these models and the speaking rate is also effec...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

A Comparative Study of Metadiscourse Markers in English and Persian University Lectures

The purpose of this study was to compare metadiscourse markers in forty English and Persian university lectures. Twenty of them were selected from the British Academic Spoken English corpus. The other 20 were selected from an Iranian website (www.maktoobkhane.com). We used Hyland’s (2005) model of metadiscourse. The metadiscourses were collected. Further, the frequency of each type was studied....

متن کامل

Spontaneous Speech in the Spoken Dutch Corpus

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a corpus of 1,000 hours of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of (computational) linguistics and language and speech technology. Although the corpus will contain a fair amount of read speech...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

A Corpus of Spontaneous Speech in Lectures: The KIT Lecture Corpus for Spoken Language Processing and Translation

نویسندگان

چکیده

منابع مشابه

The KIT Lecture Corpus for Speech Translation

Efficient Access to Lecture Audio Archives through Spoken Language Processing

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

A Comparative Study of Metadiscourse Markers in English and Persian University Lectures

Spontaneous Speech in the Spoken Dutch Corpus

عنوان ژورنال:

اشتراک گذاری